Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences
نویسندگان
چکیده
Three results are presented. First, we prove the existence of a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August 2004, and explained its properties. The 7-cluster structure is responsible for the main part of sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic ‘‘pure’’ types of this model, observed in nature: ‘‘parallel triangles’’, ‘‘perpendicular triangles’’, degenerated case and the flower-like type. Second, we answered the question: how big are the position-specific information and the contribution connected with correlations between nucleotide. The accuracy of the mean-field (context-free) approximation is estimated for bacterial genomes. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea). Description of these two codon-usage trajectories is the third result. All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site: http://www.ihes.fr/ zinovyev/7clusters. r 2005 Elsevier B.V. All rights reserved.
منابع مشابه
Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences
Coding information is the main source of heterogeneity (non-randomness) in the sequences of bacterial genomes. This information can be naturally modeled by analysing cluster structures in the “in-phase” triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August...
متن کاملIdentification of Synonymous Codon Usage Bias in the Pseudorabies Virus UL31 Gene
Background: Little knowledge of synonymous codon usage pattern of pseudorabies virus (PRV) genome, especially the UL31 gene in the process for its evolution is available. Objectives: In the present study, the codon usage bias between PRV UL31 sequence and the UL31-like sequences was identified. Materials and Methods: We used a comprehensive analysi...
متن کاملVisualizing the Spatial Structure of Triplet Distributions in Genetic Texts
We analyze several genetic texts, using visual representations of triplet count distribution in a sliding window. After appropriate normalization and projection onto a linear manifold spanned by the first three principal components, the distribution of 64-dimensional vectors of triplet frequencies appears as a cloud of points, displaying a well-detectable cluster structure. In several complete ...
متن کاملThe mystery of two straight lines in bacterial genome statistics.
In special coordinates (codon position-specific nucleotide frequencies), bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phen...
متن کاملCodon Usage Domains over Bacterial Chromosomes
The geography of codon bias distributions over prokaryotic genomes and its impact upon chromosomal organization are analyzed. To this aim, we introduce a clustering method based on information theory, specifically designed to cluster genes according to their codon usage and apply it to the coding sequences of Escherichia coli and Bacillus subtilis. One of the clusters identified in each of the ...
متن کامل